See the RMD file and the data sets of this homework from here.
Please submit your homework in PDF format with the filename “Lastname_Firstname_hw3.pdf”, on or before Sunday 24 March 2024. You may want to knit the Rmd file to HTML first, then “Open in Browser” and print as PDF.
Place your R code and comments in the chunk after each question. Put additional text outside the chunk. For example, your interpretation on the test results.
For practice purpose, it’s not recommended to use packages that were not introduced in the class.
Read in the states.txt
file into a data frame as
described.
murder_lowincome
containing murder rates for just those
states with per capita incomes (the 3rd column) less than the median per
capita income. Similarly, extract a vector called
murder_highincome
containing murder rates for just those
states with greater than (or equal to) the median per capita income. Run
a two-sample t.test()
to determine whether the mean murder
rates are different between these two groups.# Read table
states <- read.table("states.txt", header = TRUE, sep = "\t")
# Calculate the median income
median_income <- median(states$income)
# Calculate the murder rates for just those states with per capita incomes less and greater than the median per capita income
murder_lowincome <- states$murder[states$income < median_income]
murder_highincome <- states$murder[states$income >= median_income]
# Run t-test
t.test(murder_lowincome, murder_highincome)
##
## Welch Two Sample t-test
##
## data: murder_lowincome and murder_highincome
## t = 1.2364, df = 46.494, p-value = 0.2225
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8058706 3.3738706
## sample estimates:
## mean of x mean of y
## 8.020 6.736
Solution: The p-value is 0.2225 (larger than 0.05), do not reject the H0 hypothesis. There is no significant evidence to indicate that the mean murder rates are different between these two groups.
[<row_selector>, <column_selector>]
to extract
a new data frame called states_name_pop
containing only the
columns for name and population. Extract another data frame called
states_gradincome_high
containing all columns, and all rows
where the income is greater than the median of income or where hs_grad
is greater than the median of hs_grad. (There are 35 such states listed
in the table.)# Extract the column `name` and `population`
states_name_pop <- states[, c("name", "population")]
# Extract states with above median withdrawal income or high school graduation rates
median_income <- median(states$income)
median_hs_grad <- median(states$hs_grad)
states_gradincome_high <- states[states$income > median_income | states$hs_grad > median_hs_grad, ]
states_by_region_income
, where
the rows are ordered first by region and second by income. (Hint: You
may want to use the order()
function here. The
order()
function can take multiple parameters as in
order(vec1, vec2)
, which considered in turn in determining
the ordering. See more examples from here# Sorted by region and income
states_by_region_income <- states[order(states$region, states$income), ]
normalize_mean_sd()
that takes
such a vector and returns the normalized version. The function should
work even if any values are NA
(the normalized version of
NA
should simply be NA). To “normalize” a vector of
numbers, first subtract the mean from each number and then divide each
by the standard deviation of the numbers. Use vector like
sample <- c(3.2, 5.1, 2.4, 1.6, NA, 7.9)
to test your
function.# Normalisation
normalize_mean_sd <- function(x) {
mean_x <- mean(x, na.rm = TRUE)
sd_x <- sd(x, na.rm = TRUE)
(x - mean_x) / sd_x
}
sample <- c(3.2, 5.1, 2.4, 1.6, NA, 7.9)
normalized_sample <- normalize_mean_sd(sample)
normalized_sample
## [1] -0.3335277 0.4208802 -0.6511732 -0.9688186 NA 1.5326393
apply()
or another function from the apply family,
together with the function normalize_mean_sd()
that you
created, to normalize the data columns in states.txt
. Then
write the new table to a new file called
states_normalized.txt
.# Normalize the data columns `population`, `income`, `murder` and `hs_grad`
normalized_data <- as.data.frame(lapply(states[, c("population", "income", "murder", "hs_grad")], normalize_mean_sd))
states_normalized <- cbind(states[, c("name", "region")], as.data.frame(normalized_data))
# Write to new table
write.table(states_normalized, "states_normalized.txt", sep = "\t", row.names = FALSE)
head(states_normalized)
The data file ozone.csv
was obtained from the
supplementary data of
Biostatistics:
A Methodology for the Health Sciences, describing the weather
conditions in New York City in 1973. Full description available
here.
breaks
and freq
arguments
to create 20 bins and display density rather than frequency. For the box
plot, try the rainbow
function; the colors are not
necessary the same as in the figure below. The las argument changes the
label orientation. See ?par
. Look at the arguments to
boxplot to see how to change the names printed under each box.# Read the dataset Ozone
ozone <- read.csv("ozone.csv")
par(mfrow = c(1, 3))
# Scatter plot of Solar Radiation against Ozone
plot(ozone$Ozone, ozone$Solar.R, main = "Relationship between\nsolar radiation and ozone level",
xlab = "Ozone level", ylab = "Solar Radiation", pch = 16, col = "#fea400")
# Histogram of Wind Speed
hist(ozone$Wind, breaks = 20, freq = FALSE, main = "Distribution of Wind speed",
xlab = "Wind Speed", col = "#9f1fef")
# Boxplot of Ozone level per month
ozone$Month <- factor(ozone$Month, levels = 5:9, labels = c("May", "Jun", "Jul", "Aug", "Sep"))
boxplot(ozone$Ozone ~ ozone$Month, xlab = "Month", ylab = "Ozone", main = "Distribution of Ozone per-month",
col = c("#fe0000", "#cbfe00", "#00fe65", "#0065fe", "#9f1fef"), las = 2)
par
function.
Then plot Ozone versus Solar Radiation, Wind Speed and Temperature on
separate graphs. Use different colors and plotting characters on each
plot. At last, save the plot to a pdf. HINT: Create the graph first in
RStudio. When you’re happy with it, re-run the code preceded by the pdf
function to save to a file. Don’t forget to use dev.off() to close the
file.par(mfrow = c(1, 3))
# Scatter plot of Ozone against Solar Radiation
plot(ozone$Solar.R, ozone$Ozone, main = "Relationship between\nsolar radiation and ozone level",
xlab = "Solar Radiation", ylab = "Ozone level", pch = 16, col = "#8fed8f")
# Scatter plot of Ozone against Wind Speed
plot(ozone$Wind, ozone$Ozone, main = "Relationship between\nwind speed and ozone level",
xlab = "Wind Speed", ylab = "Ozone level", pch = 15, col = "#4581b3")
# Scatter plot of Ozone against Temperature
plot(ozone$Temp, ozone$Ozone, main = "Relationship between\ntemperature and ozone level",
xlab = "Temperature", ylab = "Ozone level", pch = 17, col = "#fea400")
# Save to pdf
dev.copy(pdf, "ozone_plots.pdf", width = 10, height = 3)
## pdf
## 3
dev.off()
## png
## 2
# Find the outliers
outliers <- ozone$Ozone > 100
# Create a blank plot
plot(0, 0, type = "n", xlim = range(ozone$Temp, na.rm = TRUE), ylim = range(ozone$Ozone, na.rm = TRUE),
xlab = "Temperature", ylab = "Ozone level", main = "Relationship between temperature and ozone level")
# Plot the outliers
points(ozone$Temp[outliers], ozone$Ozone[outliers], pch = 17, col = "#fe0000")
# Plot the non-outliers
points(ozone$Temp[!outliers], ozone$Ozone[!outliers], pch = 17, col = "#fea400")
# Add a legend
legend("topleft", legend = c("Ozone > 100", "Normal Ozone"), col = c("#fe0000", "#fea400"), pch = 17)
# Add a dividing line
abline(h = 100, lty = 2)
library(ggplot2)
library(gridExtra)
# Scatter plot of Ozone against Solar Radiation
p1 <- ggplot(ozone, aes(x = Solar.R, y = Ozone)) +
geom_point(pch = 16, color = "#8fed8f") +
labs(x = "Solar Radiation", y = "Ozone level") +
theme_bw()
# Scatter plot of Ozone against Wind Speed
p2 <- ggplot(ozone, aes(x = Wind, y = Ozone)) +
geom_point(pch = 15, color = "#4581b3") +
labs(x = "Wind Speed", y = "Ozone level") +
theme_bw()
# Scatter plot of Ozone against Temperature
p3 <- ggplot(ozone, aes(x = Temp, y = Ozone)) +
geom_point(pch = 17, color = "#fea400") +
labs(x = "Temperature", y = "Ozone level") +
theme_bw()
plot_grid <- grid.arrange(p1, p2, p3, ncol = 3, nrow = 1)
# Save to pdf
ggsave("ozone_plots_ggplot2.pdf", plot = plot_grid, device = "pdf", width = 10, height = 3)
The gene expression data collected by Golub et al. (1999) are among
the classical in bioinformatics. The data are stored in
golub.txt
, containing gene expression values of 3051 genes
(rows) from 38 leukemia patients (columns). Twenty-seven patients
(column 1 to 27) are diagnosed as acute lymphoblastic leukemia (ALL) and
eleven (column 28 to 38) as acute myeloid leukemia (AML). The tumor
class of ALL is 0 (negative), while the tumor class of AML is 1
(positive).
The important gene CD33 is among one of the investigated genes. It has its expression values in row 808 of the Golub data. Suppose that normality of the ALL and AML expression values has been validated and assume equal variance. Test the equality of the means by an appropriate test about gene CD33. Formulate the null hypothesis, the p-value and your conclusion.
# Read the dataset
golub <- read.table("golub.txt", header = TRUE)
# Extract the expression values of gene CD33
irow = 808
cd33_all <- golub[irow, 1:27]
cd33_aml <- golub[irow, 28:38]
# Perform t-test
t.test(cd33_all, cd33_aml, var.equal = TRUE)
##
## Two Sample t-test
##
## data: cd33_all and cd33_aml
## t = -7.9813, df = 36, p-value = 1.773e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.5487898 -0.9211602
## sample estimates:
## mean of x mean of y
## -0.8812041 0.3537709
Solution:
Conclusion: The p-value is 1.773e-09 (smaller than 0.05), reject the H0 hypothesis. The means of the gene CD33 expression values in lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) patients are significantly different.
Researchers conducted a study to investigate the relationship between a specific genetic mutation (A) and susceptibility to a rare neurological disorder (N). They collected data from a cohort of 150 individuals and summarized it in a 2x2 contingency table as follows:
Disorder (N)
Present Absent
--------- ------
Mutation (A) 25 60
No Mutation (A) 30 35
Perform a categorical analysis on this data using R and determine whether there is a significant association between the genetic mutation (A) and the occurrence of the disease (D) at a significance level of 0.05. Clearly state your steps of hypothesis testing to get full marks.
# Create the contingency table
table <- matrix(c(25, 60, 30, 35), nrow = 2, byrow = TRUE)
dimnames(table) <- list(Mutation = c("A", "No A"), Disorder = c("Present", "Absent"))
# Perform the chi-square test
chisq.test(table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table
## X-squared = 3.7541, df = 1, p-value = 0.05268
Solution: The p-value is 0.05268 (larger than 0.05), do not reject the H0 hypothesis. There is no significant association between the genetic mutation (A) and the occurrence of the disease (D) at a significance level of 0.05.
//
sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.utf8
##
## time zone: Asia/Hong_Kong
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gridExtra_2.3 ggplot2_3.5.1
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.3 knitr_1.49 rlang_1.1.4
## [5] xfun_0.48 generics_0.1.3 textshaping_0.4.0 jsonlite_1.8.9
## [9] labeling_0.4.3 glue_1.8.0 colorspace_2.1-1 htmltools_0.5.8.1
## [13] ragg_1.3.3 sass_0.4.9 fansi_1.0.6 scales_1.3.0
## [17] rmarkdown_2.29 grid_4.4.1 evaluate_1.0.1 munsell_0.5.1
## [21] jquerylib_0.1.4 tibble_3.2.1 fastmap_1.2.0 yaml_2.3.10
## [25] lifecycle_1.0.4 compiler_4.4.1 dplyr_1.1.4 pkgconfig_2.0.3
## [29] rstudioapi_0.17.1 systemfonts_1.1.0 farver_2.1.2 digest_0.6.37
## [33] R6_2.5.1 tidyselect_1.2.1 utf8_1.2.4 pillar_1.9.0
## [37] magrittr_2.0.3 bslib_0.8.0 withr_3.0.2 tools_4.4.1
## [41] gtable_0.3.6 cachem_1.1.0